Football Players Analysis

A Data Science Project by:

PRASFUR TIWARI & RAVI SISTA

Data Analysis

In [4]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

plt.style.use('ggplot')
In [5]:
# importing the data set
df = pd.read_csv('C:\\Users\\smile\\Desktop\\Prasfur\\Project 1 - FIFA\\Data\\FIFA 19.csv')
In [6]:
# exploring the data set
df.head()
Out[6]:
Unnamed: 0 ID Name Age Photo Nationality Flag Overall Potential Club ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 0 158023 L. Messi 31 https://cdn.sofifa.org/players/4/19/158023.png Argentina https://cdn.sofifa.org/flags/52.png 94 94 FC Barcelona ... 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0 €226.5M
1 1 20801 Cristiano Ronaldo 33 https://cdn.sofifa.org/players/4/19/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Juventus ... 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0 €127.1M
2 2 190871 Neymar Jr 26 https://cdn.sofifa.org/players/4/19/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 93 Paris Saint-Germain ... 94.0 27.0 24.0 33.0 9.0 9.0 15.0 15.0 11.0 €228.1M
3 3 193080 De Gea 27 https://cdn.sofifa.org/players/4/19/193080.png Spain https://cdn.sofifa.org/flags/45.png 91 93 Manchester United ... 68.0 15.0 21.0 13.0 90.0 85.0 87.0 88.0 94.0 €138.6M
4 4 192985 K. De Bruyne 27 https://cdn.sofifa.org/players/4/19/192985.png Belgium https://cdn.sofifa.org/flags/7.png 91 92 Manchester City ... 88.0 68.0 58.0 51.0 15.0 13.0 5.0 10.0 13.0 €196.4M

5 rows × 89 columns

In [7]:
df.shape
Out[7]:
(18207, 89)
In [8]:
# Listing out the columns
df.columns
Out[8]:
Index(['Unnamed: 0', 'ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag',
       'Overall', 'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special',
       'Preferred Foot', 'International Reputation', 'Weak Foot',
       'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position',
       'Jersey Number', 'Joined', 'Loaned From', 'Contract Valid Until',
       'Height', 'Weight', 'LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW',
       'LAM', 'CAM', 'RAM', 'LM', 'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM',
       'CDM', 'RDM', 'RWB', 'LB', 'LCB', 'CB', 'RCB', 'RB', 'Crossing',
       'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling',
       'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
       'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
       'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
       'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
       'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling',
       'GKKicking', 'GKPositioning', 'GKReflexes', 'Release Clause'],
      dtype='object')
In [9]:
# Remove unwanted columns
df.drop(['Unnamed: 0'],axis=1,inplace=True)
In [10]:
# Listing info of the data set
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 88 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        18207 non-null  int64  
 1   Name                      18207 non-null  object 
 2   Age                       18207 non-null  int64  
 3   Photo                     18207 non-null  object 
 4   Nationality               18207 non-null  object 
 5   Flag                      18207 non-null  object 
 6   Overall                   18207 non-null  int64  
 7   Potential                 18207 non-null  int64  
 8   Club                      17966 non-null  object 
 9   Club Logo                 18207 non-null  object 
 10  Value                     18207 non-null  object 
 11  Wage                      18207 non-null  object 
 12  Special                   18207 non-null  int64  
 13  Preferred Foot            18159 non-null  object 
 14  International Reputation  18159 non-null  float64
 15  Weak Foot                 18159 non-null  float64
 16  Skill Moves               18159 non-null  float64
 17  Work Rate                 18159 non-null  object 
 18  Body Type                 18159 non-null  object 
 19  Real Face                 18159 non-null  object 
 20  Position                  18147 non-null  object 
 21  Jersey Number             18147 non-null  float64
 22  Joined                    16654 non-null  object 
 23  Loaned From               1264 non-null   object 
 24  Contract Valid Until      17918 non-null  object 
 25  Height                    18159 non-null  object 
 26  Weight                    18159 non-null  object 
 27  LS                        16122 non-null  object 
 28  ST                        16122 non-null  object 
 29  RS                        16122 non-null  object 
 30  LW                        16122 non-null  object 
 31  LF                        16122 non-null  object 
 32  CF                        16122 non-null  object 
 33  RF                        16122 non-null  object 
 34  RW                        16122 non-null  object 
 35  LAM                       16122 non-null  object 
 36  CAM                       16122 non-null  object 
 37  RAM                       16122 non-null  object 
 38  LM                        16122 non-null  object 
 39  LCM                       16122 non-null  object 
 40  CM                        16122 non-null  object 
 41  RCM                       16122 non-null  object 
 42  RM                        16122 non-null  object 
 43  LWB                       16122 non-null  object 
 44  LDM                       16122 non-null  object 
 45  CDM                       16122 non-null  object 
 46  RDM                       16122 non-null  object 
 47  RWB                       16122 non-null  object 
 48  LB                        16122 non-null  object 
 49  LCB                       16122 non-null  object 
 50  CB                        16122 non-null  object 
 51  RCB                       16122 non-null  object 
 52  RB                        16122 non-null  object 
 53  Crossing                  18159 non-null  float64
 54  Finishing                 18159 non-null  float64
 55  HeadingAccuracy           18159 non-null  float64
 56  ShortPassing              18159 non-null  float64
 57  Volleys                   18159 non-null  float64
 58  Dribbling                 18159 non-null  float64
 59  Curve                     18159 non-null  float64
 60  FKAccuracy                18159 non-null  float64
 61  LongPassing               18159 non-null  float64
 62  BallControl               18159 non-null  float64
 63  Acceleration              18159 non-null  float64
 64  SprintSpeed               18159 non-null  float64
 65  Agility                   18159 non-null  float64
 66  Reactions                 18159 non-null  float64
 67  Balance                   18159 non-null  float64
 68  ShotPower                 18159 non-null  float64
 69  Jumping                   18159 non-null  float64
 70  Stamina                   18159 non-null  float64
 71  Strength                  18159 non-null  float64
 72  LongShots                 18159 non-null  float64
 73  Aggression                18159 non-null  float64
 74  Interceptions             18159 non-null  float64
 75  Positioning               18159 non-null  float64
 76  Vision                    18159 non-null  float64
 77  Penalties                 18159 non-null  float64
 78  Composure                 18159 non-null  float64
 79  Marking                   18159 non-null  float64
 80  StandingTackle            18159 non-null  float64
 81  SlidingTackle             18159 non-null  float64
 82  GKDiving                  18159 non-null  float64
 83  GKHandling                18159 non-null  float64
 84  GKKicking                 18159 non-null  float64
 85  GKPositioning             18159 non-null  float64
 86  GKReflexes                18159 non-null  float64
 87  Release Clause            16643 non-null  object 
dtypes: float64(38), int64(5), object(45)
memory usage: 12.2+ MB
In [11]:
# Checking the number of missing values
df.isnull().sum()
Out[11]:
ID                   0
Name                 0
Age                  0
Photo                0
Nationality          0
                  ... 
GKHandling          48
GKKicking           48
GKPositioning       48
GKReflexes          48
Release Clause    1564
Length: 88, dtype: int64
In [12]:
#Checking extent of null values
sns.heatmap(df.isnull(),yticklabels=False)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x2d4dc0cc5c8>
In [13]:
# Since the column loaned from almost has no values, we will drop it
df.drop(['Loaned From'],axis = 1, inplace=True)
In [14]:
#now the data which have NA values, we will fill them with the mean value of that column
df.fillna(df.mean(),inplace=True)
In [15]:
df.isnull().sum()
Out[15]:
ID                   0
Name                 0
Age                  0
Photo                0
Nationality          0
                  ... 
GKHandling           0
GKKicking            0
GKPositioning        0
GKReflexes           0
Release Clause    1564
Length: 87, dtype: int64
In [16]:
#Checking extent of null values
sns.heatmap(df.isnull(),yticklabels=False)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x2d4db0100c8>
In [17]:
#there are still cells in which the mean value could not be assigned. This may be because those columns have strings. So we will assign a value "Unassigned" to the dataset
df.fillna("Unassigned",inplace=True)
In [18]:
df.isnull().sum()
Out[18]:
ID                0
Name              0
Age               0
Photo             0
Nationality       0
                 ..
GKHandling        0
GKKicking         0
GKPositioning     0
GKReflexes        0
Release Clause    0
Length: 87, dtype: int64
In [19]:
df.dtypes
Out[19]:
ID                  int64
Name               object
Age                 int64
Photo              object
Nationality        object
                   ...   
GKHandling        float64
GKKicking         float64
GKPositioning     float64
GKReflexes        float64
Release Clause     object
Length: 87, dtype: object
In [20]:
#Displaying all columns
df.keys()
Out[20]:
Index(['ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag', 'Overall',
       'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special',
       'Preferred Foot', 'International Reputation', 'Weak Foot',
       'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position',
       'Jersey Number', 'Joined', 'Contract Valid Until', 'Height', 'Weight',
       'LS', 'ST', 'RS', 'LW', 'LF', 'CF', 'RF', 'RW', 'LAM', 'CAM', 'RAM',
       'LM', 'LCM', 'CM', 'RCM', 'RM', 'LWB', 'LDM', 'CDM', 'RDM', 'RWB', 'LB',
       'LCB', 'CB', 'RCB', 'RB', 'Crossing', 'Finishing', 'HeadingAccuracy',
       'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy',
       'LongPassing', 'BallControl', 'Acceleration', 'SprintSpeed', 'Agility',
       'Reactions', 'Balance', 'ShotPower', 'Jumping', 'Stamina', 'Strength',
       'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision',
       'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle',
       'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes',
       'Release Clause'],
      dtype='object')
In [21]:
# Final data clean-up procedure
df.drop(['Photo','Flag','Club Logo','Real Face','Special'],axis=1,inplace=True)
df.head()
Out[21]:
ID Name Age Nationality Overall Potential Club Value Wage Preferred Foot ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 158023 L. Messi 31 Argentina 94 94 FC Barcelona €110.5M €565K Left ... 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0 €226.5M
1 20801 Cristiano Ronaldo 33 Portugal 94 94 Juventus €77M €405K Right ... 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0 €127.1M
2 190871 Neymar Jr 26 Brazil 92 93 Paris Saint-Germain €118.5M €290K Right ... 94.0 27.0 24.0 33.0 9.0 9.0 15.0 15.0 11.0 €228.1M
3 193080 De Gea 27 Spain 91 93 Manchester United €72M €260K Right ... 68.0 15.0 21.0 13.0 90.0 85.0 87.0 88.0 94.0 €138.6M
4 192985 K. De Bruyne 27 Belgium 91 92 Manchester City €102M €355K Right ... 88.0 68.0 58.0 51.0 15.0 13.0 5.0 10.0 13.0 €196.4M

5 rows × 82 columns

In [22]:
df.isnull()
Out[22]:
ID Name Age Nationality Overall Potential Club Value Wage Preferred Foot ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
18202 False False False False False False False False False False ... False False False False False False False False False False
18203 False False False False False False False False False False ... False False False False False False False False False False
18204 False False False False False False False False False False ... False False False False False False False False False False
18205 False False False False False False False False False False ... False False False False False False False False False False
18206 False False False False False False False False False False ... False False False False False False False False False False

18207 rows × 82 columns

In [23]:
df.isnull().sum()
Out[23]:
ID                0
Name              0
Age               0
Nationality       0
Overall           0
                 ..
GKHandling        0
GKKicking         0
GKPositioning     0
GKReflexes        0
Release Clause    0
Length: 82, dtype: int64
In [24]:
#Checking extent of null values
sns.heatmap(df.isnull(),yticklabels=False)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x2d4db043fc8>
In [25]:
# Getting insights from dataset
df.describe()
Out[25]:
ID Age Overall Potential International Reputation Weak Foot Skill Moves Jersey Number Crossing Finishing ... Penalties Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes
count 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 ... 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000 18207.000000
mean 214298.338606 25.122206 66.238699 71.307299 1.113222 2.947299 2.361308 19.546096 49.734181 45.550911 ... 48.548598 58.648274 47.281623 47.697836 45.661435 16.616223 16.391596 16.232061 16.388898 16.710887
std 29965.244204 4.669943 6.908930 6.136496 0.393511 0.659585 0.755167 15.921465 18.340299 19.500064 ... 15.683338 11.421047 19.878141 21.635426 21.261052 17.672007 16.884598 16.481095 17.012198 17.931434
min 16.000000 16.000000 46.000000 48.000000 1.000000 1.000000 1.000000 1.000000 5.000000 2.000000 ... 5.000000 3.000000 3.000000 2.000000 3.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 200315.500000 21.000000 62.000000 67.000000 1.000000 3.000000 2.000000 8.000000 38.000000 30.000000 ... 39.000000 51.000000 30.000000 27.000000 24.000000 8.000000 8.000000 8.000000 8.000000 8.000000
50% 221759.000000 25.000000 66.000000 71.000000 1.000000 3.000000 2.000000 17.000000 54.000000 49.000000 ... 49.000000 59.000000 53.000000 55.000000 52.000000 11.000000 11.000000 11.000000 11.000000 11.000000
75% 236529.500000 28.000000 71.000000 75.000000 1.000000 3.000000 3.000000 26.000000 64.000000 62.000000 ... 60.000000 67.000000 64.000000 66.000000 64.000000 14.000000 14.000000 14.000000 14.000000 14.000000
max 246620.000000 45.000000 94.000000 95.000000 5.000000 5.000000 5.000000 99.000000 93.000000 95.000000 ... 92.000000 96.000000 94.000000 93.000000 91.000000 90.000000 92.000000 91.000000 90.000000 94.000000

8 rows × 42 columns

Data Visualization

In [26]:
# Age distribution of players 
# Histogram: number of players's age
sns.set(style ="dark", palette="colorblind", color_codes=True)
x = df.Age
plt.figure(figsize=(12,8))
ax = sns.distplot(x, bins = 58, kde = False, color='g')
ax.set_xlabel(xlabel="Player\'s age", fontsize=16)
ax.set_ylabel(ylabel='Number of players', fontsize=16)
ax.set_title(label='Histogram of players age', fontsize=20)
plt.show()

oldest = df.loc[df['Age'].idxmax()]
print("The oldest player in FIFA 19 is", df['Age'].max(), "years old. His name is", oldest['Name'], 
      'he is from',oldest['Nationality'],'and plays for',oldest['Club'],'.')

print('The median age of a player on FIFA 19 is', np.mean(df['Age']))

youngest = df.loc[df['Age'].idxmin()]
print('The youngest players is',df['Age'].min(), "years old. His name is", youngest['Name'], 
      'he is from',youngest['Nationality'],'and plays for',youngest['Club'],'.')
The oldest player in FIFA 19 is 45 years old. His name is O. Pérez he is from Mexico and plays for Pachuca .
The median age of a player on FIFA 19 is 25.122205745043114
The youngest players is 16 years old. His name is W. Geubbels he is from France and plays for AS Monaco .
In [27]:
# Overall rating distribution of players
plt.hist(df['Overall'])
plt.xlabel('Players Rating ')
plt.ylabel('Number of players')
plt.show()

best = df.loc[df['Overall'].idxmax()]
print("The best player in FIFA 19 is", df['Overall'].max(), "overall. His name is", best['Name'], 
      'he is from',best['Nationality'],'and plays for',best['Club'],'.')

print('The median rating of a player on FIFA 19 is', np.mean(df['Overall']))

worst = df.loc[df['Overall'].idxmin()]
print('The worst players is',df['Overall'].min(), "overall. His name is", worst['Name'], 
      'he is from',worst['Nationality'],'and plays for',worst['Club'],'.')
The best player in FIFA 19 is 94 overall. His name is L. Messi he is from Argentina and plays for FC Barcelona .
The median rating of a player on FIFA 19 is 66.23869940132916
The worst players is 46 overall. His name is G. Nugent he is from England and plays for Tranmere Rovers .
In [28]:
# Distribution of Player's potential rating 
plt.hist(df['Potential'], color = 'blue')
plt.xlabel('Players Potential')
plt.ylabel('Number of players')
plt.show()

bestp = df.loc[df['Potential'].idxmax()]
print("The best potential player in FIFA 19 is", df['Potential'].max(), "overall. His name is", bestp['Name'], 
      'he is from',bestp['Nationality'],'and plays for',bestp['Club'],'.')

print('The median potential rating of a player on FIFA 19 is', np.mean(df['Potential']))

worstp = df.loc[df['Potential'].idxmin()]
print('The worst potential player is',df['Potential'].min(), "overall. His name is", worstp['Name'], 
      'he is from',worstp['Nationality'],'and plays for',worstp['Club'],'.')
The best potential player in FIFA 19 is 95 overall. His name is K. Mbappé he is from France and plays for Paris Saint-Germain .
The median potential rating of a player on FIFA 19 is 71.30729939034437
The worst potential player is 48 overall. His name is Y. Uchimura he is from Japan and plays for Hokkaido Consadole Sapporo .
In [29]:
# Displaying the Age vs Overall performance
plt.figure(figsize=(10, 5))
sns.regplot(df['Age'] , df['Overall'])
plt.title('Age vs Overall rating')
plt.show()
In [30]:
# Displaying the Age vs Potential performance
plt.figure(figsize=(10, 5))
sns.regplot(df['Age'],df['Potential'],color = 'green')
plt.title('Age vs Potential rating')
plt.show()
In [31]:
# Plotting the heatmap of the data set
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(),linewidths=3)
plt.title('Dataset Heatmap')
plt.show()
In [32]:
# Preferred foot analysis
plt.figure(figsize=(5,5))
sns.countplot(df['Preferred Foot'])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x2d4dbb26608>
In [33]:
# Displaying the Top 25 players
df_best_players = pd.DataFrame.copy(df.sort_values(by = 'Overall' , 
                                                   ascending = False ).head(25))

plt.figure(figsize=(20, 10))
plt.bar('Name' , 'Overall' , data = df_best_players, width=0.5, color = 'Purple')
plt.xlabel('Players names', fontsize=30) 
plt.xticks(rotation = 90,fontsize=20, fontname='monospace')
plt.ylabel('Overall Rating', fontsize=30)
plt.title('Top 25 players Overall Rating', fontsize=40)
plt.ylim(87 , 95)
plt.show()
In [34]:
#Stamina vs Sprint Speed Plot
df.plot(kind = 'scatter' , x='Stamina' , y = 'SprintSpeed' , alpha = .5 )
plt.xlabel('Stamina')
plt.ylabel('Sprint Speed')
plt.title('Stamina-Sprint Speed Scatter Plot')
plt.show()
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
In [35]:
# Displaying the relation between Age & Sprint Speed
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['SprintSpeed'], color = 'Green')
plt.title('Age vs Sprint Speed')
plt.show()
In [36]:
# Displaying the relation between Overall & Potential
df.groupby('Overall')['Potential'].mean().plot()
plt.title('Overall vs Potential')
plt.ylabel("Potential",rotation=90)
plt.show()
In [37]:
#count of players Position Wise
plt.figure(1,figsize=(20,10))
p = sns.countplot(x = 'Position', data = df,palette='inferno_r')
p.set_title(label='Count of Players', fontsize=25)
Out[37]:
Text(0.5, 1.0, 'Count of Players')
In [38]:
#Cleaning some of values so that we can interpret them 
def value_to_int(df_value):
    try:
        value = float(df_value[1:-1])
        suffix = df_value[-1:]

        if suffix == 'M':
            value = value * 1000000
        elif suffix == 'K':
            value = value * 1000
    except ValueError:
        value = 0
    return value

df['Value'] = df['Value'].apply(value_to_int)
df['Wage'] = df['Wage'].apply(value_to_int)
In [39]:
# Displaying the Values of players
df1 = pd.DataFrame.copy(df.sort_values(by = 'Value' , ascending = False ).head(25))

plt.figure(figsize=(20, 5))
plt.bar('Name' , 'Value' , data = df1, width=0.5, color = 'Orange')
plt.xlabel('Players names', fontsize=30) 
plt.xticks(rotation = 90,fontsize=20, fontname='monospace')
plt.ylabel('Value', fontsize=30)
plt.title('Value of Players(in Millions)', fontsize=30)
plt.show()
In [40]:
# Distribution of Wage of Players
df2 = pd.DataFrame.copy(df.sort_values(by = 'Wage' , ascending = False ).head(25))

plt.figure(figsize=(20, 5))
plt.bar('Name' , 'Wage' , data = df2, width=0.5, color = 'Red')
plt.xlabel('Players names', fontsize=30) 
plt.xticks(rotation = 90,fontsize=20, fontname='monospace')
plt.ylabel('Wage', fontsize=30)
plt.title('Wage of Players(in Millions)', fontsize=30)
plt.show()
In [41]:
# Distribution of Jersey Number
df['Jersey Number'].plot(kind = 'hist',bins= 320, color= 'blue', label = 'Jersey Number', 
                         alpha = 1.0, grid = True, figsize = (10,5))

plt.legend
plt.xlabel('Number')
plt.ylabel('Players')
plt.title('Average Jersey Number')
plt.show()
In [42]:
# Distribution of Overall of players
df.Overall.plot(kind = 'hist',bins= 400, color= 'green', 
                label = 'Overall', alpha = 1.0, grid = False, figsize = (10,5))

plt.legend
plt.xlabel('Overall')
plt.ylabel('Players')
plt.title('Average Overall')
plt.show()
In [43]:
# Relation between Shotpower & Finishing
df.plot(kind = 'scatter', x='ShotPower', y='Finishing', alpha = 1.0, color = 'yellow')
plt.xlabel('ShotPower')              
plt.ylabel('Finishing')
plt.title('ShotPower vs Finishing Scatter Plot')
Out[43]:
Text(0.5, 1.0, 'ShotPower vs Finishing Scatter Plot')
In [44]:
# Relation between Composure & Penalties
df.plot(kind = 'scatter', x = 'Composure', y = 'Penalties', alpha = 1.0, color = 'Orange')
plt.xlabel('Composure')
plt.ylabel('Penalties')
plt.title('Composure vs Penalties Scatter plot')
Out[44]:
Text(0.5, 1.0, 'Composure vs Penalties Scatter plot')
In [45]:
# Relation between Ball Control and Short Passing
df.plot(kind = 'scatter', x = 'BallControl', y = 'ShortPassing',alpha = 1.0, color = 'Brown')
plt.xlabel('BallControl')              
plt.ylabel('ShortPassing')
plt.title('BallControl vs ShortPassing Scatter Plot')   
Out[45]:
Text(0.5, 1.0, 'BallControl vs ShortPassing Scatter Plot')
In [46]:
# Relation between Sliding tackle & Interceptions
plt.figure(1,figsize=(10,5))
sns.regplot(df['SlidingTackle'],df['Interceptions'], color = 'Green')
plt.title('Sliding Tackle vs Interceptions Speed')
plt.show()
In [47]:
# Relation between Gk Diving & GK Positioning
plt.figure(figsize = (10,5))
sns.regplot(df['GKDiving'], df['GKPositioning'], color = 'Purple')
plt.title('GK Diving vs GK Positioning')
plt.show()
In [48]:
# The clubs and their players overalls
clubs = ('Juventus', 'Real Madrid', 'Paris Saint-Germain', 'FC Barcelona', 'Liverpool',
         'Manchester United', 'FC Bayern München', 'Manchester City', 'Napoli')
plt.figure(figsize = (15,5))

df_club = df.loc[df['Club'].isin(clubs) & df['Age'] & df['Overall'] ]

ax = sns.barplot(x=df_club['Club'], y=df_club['Overall'], palette="rocket")
ax.set_title(label='Distribution overall in several clubs', fontsize=20);
In [49]:
#best players per each position with their age, club, and nationality based on their overall scores

df.iloc[df.groupby(df['Position'])['Overall'].idxmax()][['Position', 'Name', 'Age', 'Club', 'Nationality']]
Out[49]:
Position Name Age Club Nationality
17 CAM A. Griezmann 27 Atlético Madrid France
12 CB D. Godín 32 Atlético Madrid Uruguay
20 CDM Sergio Busquets 29 FC Barcelona Spain
271 CF Luis Alberto 25 Lazio Spain
67 CM Thiago 27 FC Bayern München Spain
3 GK De Gea 27 Manchester United Spain
28 LAM J. Rodríguez 26 FC Bayern München Colombia
35 LB Marcelo 30 Real Madrid Brazil
24 LCB G. Chiellini 33 Juventus Italy
11 LCM T. Kroos 28 Real Madrid Germany
14 LDM N. Kanté 27 Chelsea France
5 LF E. Hazard 27 Chelsea Belgium
33 LM P. Aubameyang 29 Arsenal Gabon
21 LS E. Cavani 31 Paris Saint-Germain Uruguay
2 LW Neymar Jr 26 Paris Saint-Germain Brazil
474 LWB N. Schulz 25 TSG 1899 Hoffenheim Germany
129 RAM J. Cuadrado 30 Juventus Colombia
69 RB Azpilicueta 28 Chelsea Spain
8 RCB Sergio Ramos 32 Real Madrid Spain
4 RCM K. De Bruyne 27 Manchester City Belgium
45 RDM P. Pogba 25 Manchester United France
0 RF L. Messi 31 FC Barcelona Argentina
25 RM K. Mbappé 19 Paris Saint-Germain France
7 RS L. Suárez 31 FC Barcelona Uruguay
56 RW Bernardo Silva 23 Manchester City Portugal
450 RWB M. Ginter 24 Borussia Mönchengladbach Germany
1 ST Cristiano Ronaldo 33 Juventus Portugal
5018 Unassigned R. Raldes 37 Unassigned Bolivia
In [50]:
# Best features of players
pr_cols=['Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys',
       'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 'BallControl',
       'Acceleration', 'SprintSpeed', 'Agility', 'Reactions', 'Balance',
       'ShotPower', 'Jumping', 'Stamina', 'Strength', 'LongShots',
       'Aggression', 'Interceptions', 'Positioning', 'Vision', 'Penalties',
       'Composure', 'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving',
       'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']
i=0
while i < len(pr_cols):
    print('Best {0} : {1}'.format(pr_cols[i],df.loc[df[pr_cols[i]].idxmax()][1]))
    i += 1
Best Crossing : K. De Bruyne
Best Finishing : L. Messi
Best HeadingAccuracy : Naldo
Best ShortPassing : L. Modrić
Best Volleys : E. Cavani
Best Dribbling : L. Messi
Best Curve : Quaresma
Best FKAccuracy : L. Messi
Best LongPassing : T. Kroos
Best BallControl : L. Messi
Best Acceleration : Douglas Costa
Best SprintSpeed : K. Mbappé
Best Agility : Neymar Jr
Best Reactions : Cristiano Ronaldo
Best Balance : Bernard
Best ShotPower : Cristiano Ronaldo
Best Jumping : Cristiano Ronaldo
Best Stamina : N. Kanté
Best Strength : A. Akinfenwa
Best LongShots : L. Messi
Best Aggression : B. Pearson
Best Interceptions : N. Kanté
Best Positioning : Cristiano Ronaldo
Best Vision : L. Messi
Best Penalties : M. Balotelli
Best Composure : L. Messi
Best Marking : A. Barzagli
Best StandingTackle : G. Chiellini
Best SlidingTackle : Sergio Ramos
Best GKDiving : De Gea
Best GKHandling : J. Oblak
Best GKKicking : M. Neuer
Best GKPositioning : G. Buffon
Best GKReflexes : De Gea
In [51]:
# Height of Players

plt.figure(figsize = (13, 8))
ax = sns.countplot(x = 'Height', data = df, palette = 'dark')
ax.set_title(label = 'Count of players on Basis of Height', fontsize = 20)
ax.set_xlabel(xlabel = 'Height in Foot per inch', fontsize = 16)
ax.set_ylabel(ylabel = 'Count', fontsize = 16)
plt.show()
In [52]:
# best players from each positions with their age, nationality, club based on their potential scores

df.iloc[df.groupby(df['Position'])['Potential'].idxmax()][['Position', 'Name', 'Age', 'Club', 'Nationality']]
Out[52]:
Position Name Age Club Nationality
31 CAM C. Eriksen 26 Tottenham Hotspur Denmark
42 CB S. Umtiti 24 FC Barcelona France
27 CDM Casemiro 26 Real Madrid Brazil
350 CF A. Milik 24 Napoli Poland
78 CM S. Milinković-Savić 23 Lazio Serbia
3 GK De Gea 27 Manchester United Spain
28 LAM J. Rodríguez 26 FC Bayern München Colombia
35 LB Marcelo 30 Real Madrid Brazil
77 LCB M. Škriniar 23 Inter Slovakia
11 LCM T. Kroos 28 Real Madrid Germany
14 LDM N. Kanté 27 Chelsea France
15 LF P. Dybala 24 Juventus Argentina
415 LM H. Aouar 20 Olympique Lyonnais France
21 LS E. Cavani 31 Paris Saint-Germain Uruguay
2 LW Neymar Jr 26 Paris Saint-Germain Brazil
601 LWB Jonny 24 Wolverhampton Wanderers Spain
171 RAM H. Ziyech 25 Ajax Morocco
247 RB João Cancelo 24 Juventus Portugal
8 RCB Sergio Ramos 32 Real Madrid Spain
4 RCM K. De Bruyne 27 Manchester City Belgium
45 RDM P. Pogba 25 Manchester United France
0 RF L. Messi 31 FC Barcelona Argentina
25 RM K. Mbappé 19 Paris Saint-Germain France
7 RS L. Suárez 31 FC Barcelona Uruguay
79 RW Marco Asensio 22 Real Madrid Spain
766 RWB Pablo Maffeo 20 VfB Stuttgart Spain
1 ST Cristiano Ronaldo 33 Juventus Portugal
13254 Unassigned A. Aidonis 17 VfB Stuttgart Germany
In [53]:
# Every Nations' Player and their overall scores

some_countries = ('England', 'Germany', 'Spain', 'Argentina', 'France', 'Brazil', 'Italy', 'Columbia')
df_countries = df.loc[df['Nationality'].isin(some_countries) & df['Overall']]

plt.figure(figsize = (10,5))
ax = sns.barplot(x = df_countries['Nationality'], y = df_countries['Overall'], palette = 'colorblind')
ax.set_xlabel(xlabel = 'Countries', fontsize = 15)
ax.set_ylabel(ylabel = 'Overall Scores', fontsize = 15)
ax.set_title(label = 'Distribution of overall scores of players from different countries', fontsize = 20)
plt.show()
In [54]:
some_clubs = ('Manchester United', 'Liverpool', 'Juventus', 'Napoli', 'Arsenal', 'Manchestar City',
             'Tottenham Hotspur', 'FC Barcelona', 'Valencia CF', 'Chelsea', 'Real Madrid')

data_clubs = df.loc[df['Club'].isin(some_clubs) & df['Overall']]

plt.figure(figsize = (20,5))
ax = sns.barplot(x = data_clubs['Club'], y = data_clubs['Overall'], palette = 'deep')
ax.set_xlabel(xlabel = 'Some Popular Clubs', fontsize = 15)
ax.set_ylabel(ylabel = 'Overall Score', fontsize = 15)
ax.set_title(label = 'Distribution of Overall Score in Different popular Clubs', fontsize = 20)
plt.show()
In [55]:
# defining the features of players

player_features = ('Acceleration', 'Aggression', 'Agility', 
                   'Balance', 'BallControl', 'Composure', 
                   'Crossing', 'Dribbling', 'FKAccuracy', 
                   'Finishing', 'GKDiving', 'GKHandling', 
                   'GKKicking', 'GKPositioning', 'GKReflexes', 
                   'HeadingAccuracy', 'Interceptions', 'Jumping', 
                   'LongPassing', 'LongShots', 'Marking', 'Penalties')

# Top four features for every position in football

for i, val in df.groupby(df['Position'])[player_features].mean().iterrows():
    print('Position {}: {}, {}, {}'.format(i, *tuple(val.nlargest(4).index)))
Position CAM: Balance, Agility, Acceleration
Position CB: Jumping, Aggression, HeadingAccuracy
Position CDM: Aggression, Jumping, Balance
Position CF: Agility, Balance, Acceleration
Position CM: Balance, Agility, Acceleration
Position GK: GKReflexes, GKDiving, GKPositioning
Position LAM: Agility, Balance, Acceleration
Position LB: Acceleration, Balance, Agility
Position LCB: Jumping, Aggression, HeadingAccuracy
Position LCM: Balance, Agility, BallControl
Position LDM: Aggression, BallControl, LongPassing
Position LF: Balance, Agility, Acceleration
Position LM: Acceleration, Agility, Balance
Position LS: Acceleration, Agility, Finishing
Position LW: Acceleration, Agility, Balance
Position LWB: Acceleration, Agility, Balance
Position RAM: Agility, Balance, Acceleration
Position RB: Acceleration, Balance, Jumping
Position RCB: Jumping, Aggression, HeadingAccuracy
Position RCM: Agility, Balance, BallControl
Position RDM: Aggression, Jumping, BallControl
Position RF: Agility, Acceleration, Balance
Position RM: Acceleration, Agility, Balance
Position RS: Acceleration, Agility, Jumping
Position RW: Acceleration, Agility, Balance
Position RWB: Acceleration, Agility, Balance
Position ST: Acceleration, Jumping, Finishing
Position Unassigned: Acceleration, Balance, Jumping
In [56]:
# Top five the most expensive clubs
df.groupby(['Club'])['Value'].sum().sort_values(ascending = False).head()
Out[56]:
Club
Real Madrid          874425000.0
FC Barcelona         852600000.0
Manchester City      786555000.0
Juventus             704475000.0
FC Bayern München    679025000.0
Name: Value, dtype: float64
In [57]:
# Top five the less expensive clubs
df.groupby(['Club'])['Value'].sum().sort_values().head()
Out[57]:
Club
Unassigned              0.0
Bray Wanderers    1930000.0
Limerick FC       2040000.0
Derry City        2795000.0
Bohemian FC       3195000.0
Name: Value, dtype: float64
In [58]:
# Top five teams with the best players
df.groupby(['Club'])['Overall'].max().sort_values(ascending = False).head()
Out[58]:
Club
Juventus               94
FC Barcelona           94
Paris Saint-Germain    92
Manchester United      91
Manchester City        91
Name: Overall, dtype: int64
In [59]:
# Relation between Age & Reactions
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['Reactions'], color = 'blue')
plt.title('Age vs Reactions')
plt.show()
In [60]:
# Relation between Age & Shotpower
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['ShotPower'], color = 'green')
plt.title('Age vs Shotpower')
plt.show()
In [61]:
# Relation between Age & Jumping
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['Jumping'], color = 'Orange')
plt.title('Age vs Jumping')
plt.show()
In [62]:
# Relation between Age & SprintSpeed
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['SprintSpeed'], color = 'Purple')
plt.title('Age vs SprintSpeed')
plt.show()
In [63]:
# Relation between Age & Stamina
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['Stamina'], color = 'Brown')
plt.title('Age vs Stamina')
plt.show()
In [64]:
# Relation between Age & Agility
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['Agility'], color = 'Yellow')
plt.title('Age vs Agility')
plt.show()
In [65]:
# Relation between Age & Strength
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['Strength'], color = 'aqua')
plt.title('Age vs Strength')
plt.show()
In [66]:
# Relation between Age & Vision
plt.figure(1,figsize=(10,5))
sns.regplot(df['Age'],df['Vision'], color = 'navy')
plt.title('Age vs Vision')
plt.show()
In [67]:
# Distribution of players according to their Overall
plt.figure(figsize = (20,10))
sns.countplot(df['Overall'], palette='rocket')
plt.show()
In [68]:
# Eldest players
df.sort_values(by = 'Age' , ascending = False)[['Name','Club','Nationality','Overall', 'Age' ]].head()
Out[68]:
Name Club Nationality Overall Age
4741 O. Pérez Pachuca Mexico 71 45
18183 K. Pilkington Cambridge United England 48 44
17726 T. Warner Accrington Stanley Trinidad & Tobago 53 44
10545 S. Narazaki Nagoya Grampus Japan 65 42
7225 C. Muñoz CD Universidad de Concepción Argentina 68 41
In [69]:
# Youngest players
df.sort_values(by = 'Age' , ascending = True)[['Name','Club','Nationality','Overall', 'Age' ]].head()
Out[69]:
Name Club Nationality Overall Age
18206 G. Nugent Tranmere Rovers England 46 16
17743 J. Olstad Sarpsborg 08 FF Norway 52 16
13293 H. Massengo AS Monaco France 62 16
16081 J. Italiano Perth Glory Australia 58 16
18166 N. Ayéva Örebro SK Sweden 48 16
In [70]:
# Best Freekick takers
df.sort_values(by = 'FKAccuracy' , ascending = False)[['Name','Club','Nationality','Overall', 'Age','FKAccuracy']].head()
Out[70]:
Name Club Nationality Overall Age FKAccuracy
0 L. Messi FC Barcelona Argentina 94 31 94.0
293 S. Giovinco Toronto FC Italy 82 31 93.0
72 M. Pjanić Juventus Bosnia Herzegovina 86 28 92.0
1113 E. Bardhi Levante UD FYR Macedonia 77 22 91.0
449 H. Çalhanoğlu Milan Turkey 80 24 90.0
In [71]:
# Best Penalty takers
df.sort_values(by = 'Penalties' , ascending = False)[['Name','Club','Nationality','Overall', 'Age','Penalties']].head()
Out[71]:
Name Club Nationality Overall Age Penalties
206 M. Balotelli OGC Nice Italy 83 27 92.0
118 Fabinho Liverpool Brazil 84 24 91.0
16 H. Kane Tottenham Hotspur England 89 24 90.0
823 R. Jiménez Wolverhampton Wanderers Mexico 78 27 90.0
945 L. Baines Everton England 77 33 90.0
In [72]:
# Players with best ball control
df.sort_values(by = 'BallControl' , ascending = False)[['Name','Club','Nationality','Overall', 'Age','BallControl']].head()
Out[72]:
Name Club Nationality Overall Age BallControl
0 L. Messi FC Barcelona Argentina 94 31 96.0
2 Neymar Jr Paris Saint-Germain Brazil 92 26 95.0
30 Isco Real Madrid Spain 88 26 95.0
13 David Silva Manchester City Spain 90 32 94.0
5 E. Hazard Chelsea Belgium 91 27 94.0
In [73]:
# Quick players
df.sort_values(by = 'SprintSpeed' , ascending = False)[['Name','Club','Nationality','Overall', 'Age','SprintSpeed']].head()
Out[73]:
Name Club Nationality Overall Age SprintSpeed
55 L. Sané Manchester City Germany 86 22 96.0
25 K. Mbappé Paris Saint-Germain France 88 19 96.0
1968 Adama Wolverhampton Wanderers Spain 75 22 96.0
36 G. Bale Real Madrid Wales 88 28 95.0
10928 Maicon Livorno Brazil 65 25 95.0
In [74]:
# Age distrbution among famous clubs
clubs = ['Chelsea' , 'Arsenal', 'Juventus', 'Paris Sain-Germain' ,'FC Bayern München',
       'Real Madrid' , 'FC Barcelona' , 'Borussia Dortmund' , 'Manchester United' ,
       'FC Porto', 'Liverpool', 'Manchester City']

club_age = df.loc[df['Club'].isin(clubs) & df['Age']]
plt.figure(1 , figsize = (15 ,7))
sns.boxplot(x = 'Club' , y = 'Age' , data = club_age,palette='rocket')
plt.title('Age Distribution in famous clubs')
plt.xticks(rotation = 50)
plt.show()
In [75]:
# Overall Rating of the clubs
club_rating = df.loc[df['Club'].isin(clubs) & df['Overall']]
plt.figure(1 , figsize = (15 ,7))
sns.boxplot(x = 'Club' , y = 'Overall' , data = club_rating, palette='rocket')
plt.title('Overall Rating Distribution in famous clubs')
plt.xticks(rotation = 50)
plt.show()
In [76]:
# Best club
best_dict = {}
for club in df['Club'].unique():
    overall_rating = df['Overall'][df['Club'] == club].sum()
    best_dict[club] = overall_rating
best_club = pd.DataFrame.from_dict(best_dict,orient='index', columns = ['overall'])
best_club['club'] = best_club.index
best_club = best_club.sort_values(by = 'overall' , ascending =  False)

plt.figure(1 , figsize = (15 , 6))
sns.barplot(x = 'club' , y  = 'overall' , data = best_club.head(5),palette='rocket')  
plt.xlabel("Club", size = 15)
plt.ylabel('Sum of Overall Rating of players in club', size = 15)
plt.title('Clubs with best Players (sum of overall ratings of players per club)', size = 25)
plt.ylim(2450 , 2600)
plt.show()
In [77]:
# Wage vs Overall
plt.figure(figsize=(25,10))
sns.barplot(data=df.head(30),y='Wage',x='Overall',palette='rocket')
plt.title('Wage vs Overall',size=35)
plt.xlabel("Overall",size=25)
plt.ylabel('Wage',size=25)
plt.show()
In [78]:
# Wage vs Potental
plt.figure(figsize=(25,10))
sns.barplot(data=df.head(30),y='Wage',x='Potential',palette='rocket')
plt.title('Wage vs Potential',size=35)
plt.xlabel("Wage",size=25)
plt.ylabel('Potential',size=25)
plt.show()

This gives us a mere idea that wages of players depends more on their overall performance than their potential

In [79]:
# Count of players for each position
df3=pd.DataFrame(df['Position'].value_counts())
df3.reset_index(inplace=True)
df3.rename(columns={'index':'Position',"Position":"Count"},inplace=True)
df3.head(10)
Out[79]:
Position Count
0 ST 2152
1 GK 2025
2 CB 1778
3 CM 1394
4 LB 1322
5 RB 1291
6 RM 1124
7 LM 1095
8 CAM 958
9 CDM 948
In [80]:
plt.figure(figsize=(20,10))
sns.countplot(x='Position',data=df)
plt.title("Player count per position",size=25)
plt.xlabel("Position",size=20)
plt.ylabel("Count",size=20)
plt.show()
print("Maximum players were found with position {} with a count of {}".format(df3['Position'][0],df3['Count'][0]))
Maximum players were found with position ST with a count of 2152
In [81]:
import plotly.graph_objects as go
fig=go.Figure(data=go.Scatterpolar(r=df3['Count'].head(),theta=['ST','GK','CB', 'CM','LB'],fill='toself'))
fig.update_layout(polar=dict(radialaxis=dict(visible=True),),showlegend=False)
fig.show()
In [82]:
#Players whose contract has ended (till 2019)
df2=df[df['Contract Valid Until']<='2019'][['ID','Name','Age','Nationality','Contract Valid Until','Position','Club']]
print("Total Contracts ended:",df2.shape[0])
df2[['ID','Name','Age','Nationality','Position','Club']].head(15)
Total Contracts ended: 5705
Out[82]:
ID Name Age Nationality Position Club
12 182493 D. Godín 32 Uruguay CB Atlético Madrid
41 1179 G. Buffon 40 Italy GK Paris Saint-Germain
51 172871 J. Vertonghen 31 Belgium LCB Tottenham Hotspur
86 193747 Koke 26 Spain LM Atlético Madrid
94 184267 Y. Brahimi 28 Algeria LM FC Porto
106 164169 Filipe Luís 32 Brazil LB Atlético Madrid
107 139720 V. Kompany 32 Belgium CB Manchester City
108 120533 Pepe 35 Portugal RCB Beşiktaş JK
116 211300 A. Martial 22 France LW Manchester United
152 137186 A. Barzagli 37 Italy CB Juventus
154 9014 A. Robben 34 Netherlands RM FC Bayern München
168 210008 A. Rabiot 23 France CM Paris Saint-Germain
206 186627 M. Balotelli 27 Italy LS OGC Nice
209 179944 David Luiz 31 Brazil LCB Chelsea
211 178088 Juan Mata 30 Spain RM Manchester United
In [83]:
# Players' age when their contract ended (2019)
plt.figure(figsize=(20,10))
sns.barplot(data=df2.head(15),x=df2['Name'].head(15),y=df2['Age'].head(15))
plt.title('Age of players when their contract ended (top 15)',size=20)
plt.xlabel("Names",size=25)
plt.ylabel('Age',size=25)
plt.show()
print("Average age of players whose contract ended:",int(df2['Age'].mean()))
Average age of players whose contract ended: 26
In [84]:
#Contracts ended per year
plt.figure(figsize=(5,5))
sns.countplot(x=df2['Contract Valid Until'])
plt.title('No. of contracts ended in previous years',size=15)
plt.xlabel("Years",size=15)
plt.ylabel('Count',size=15)
plt.show()
print("Contracts ended in 2019:",df2['Contract Valid Until'].value_counts()[0])
print("Contracts ended in 2018:",df2['Contract Valid Until'].value_counts()[1])
Contracts ended in 2019: 4819
Contracts ended in 2018: 886
In [85]:
# Contracts ended for various positions
d=pd.DataFrame(df2['Position'].value_counts())
d.reset_index(inplace=True)
d.rename(columns={"index":"Position","Position":"Count"},inplace=True)
d.head()
Out[85]:
Position Count
0 GK 738
1 ST 630
2 CB 596
3 RB 448
4 CM 444
In [86]:
plt.figure(figsize=(20,10))
sns.countplot(x=df2['Position'])
plt.title('No. of contracts ended per position',size=20)
plt.xlabel("Years",size=15)
plt.ylabel('Count',size=15)
plt.show()
print("Most contracts ended for position {} and their count is: {}".format(d['Position'][0],d['Count'][0]))
Most contracts ended for position GK and their count is: 738
In [87]:
#Contracts ended per club
d1=pd.DataFrame(df2['Club'].value_counts())
d1.reset_index(inplace=True)
d1.rename(columns={"index":"Club","Club":"Count"},inplace=True)
d1.head(10)
Out[87]:
Club Count
0 Bohemian FC 25
1 Montreal Impact 24
2 Waterford FC 24
3 Cork City 23
4 Bray Wanderers 23
5 De Graafschap 23
6 FC Emmen 22
7 HJK Helsinki 22
8 Exeter City 21
9 St. Patrick's Athletic 20
In [88]:
plt.figure(figsize=(15,10))
plt.bar('Club','Count',data=d1.head(10),width=0.7,color='red')
plt.title('No. of contracts ended for clubs (top 10)',size=20)
plt.xlabel("Clubs",size=25)
plt.xticks(rotation = 90,fontsize=15, fontname='sans-serif')
plt.ylabel('Count',size=25)
plt.show()
print("\nMost contracts ended for club {} and its count is: {}".format(d1['Club'][0],d1['Count'][0]))
Most contracts ended for club Bohemian FC and its count is: 25
In [89]:
df1=pd.DataFrame(df['Nationality'].value_counts())
df1.reset_index(inplace=True)
df1.rename(columns={'index':"Nation","Nationality":"Count"},inplace=True)
df1.head()
Out[89]:
Nation Count
0 England 1662
1 Germany 1198
2 Spain 1072
3 Argentina 937
4 France 914
In [90]:
# Most players from a country
plt.figure(figsize=(25,10))
sns.barplot(data=df1.head(20),x='Nation',y='Count',palette='rocket')
plt.title('Most players from a country (top 20)',size=35)
plt.xlabel("Nations",size=25)
plt.ylabel('Count',size=25)
plt.show()
In [91]:
# Pie chart depiction of countries with most players
plt.figure()
fig1,ax1=plt.subplots(figsize=(8,8))
plt.subplots_adjust(left=0.5,wspace=0.2)
ax1.pie(df1['Count'].head(7),explode=(0.1,0,0,0,0,0,0,),labels=df1['Nation'].head(7),autopct='%1.1f%%',shadow=True,startangle=90)
ax1.axis('equal')
plt.legend(df1['Nation'].head(7),loc="best")
plt.tight_layout()
plt.show()
<Figure size 432x288 with 0 Axes>

This signifies that England and Germany have most footballers.

In [92]:
import folium
from geopy.geocoders import Nominatim
import requests
In [93]:
lat=[]
lng=[]
geolocator=Nominatim(user_agent='foursquare_api')
for nm in df1['Nation'].head(20):
    location=geolocator.geocode(str(nm),timeout=10)
    lat.append(location.latitude)
    lng.append(location.longitude)
In [94]:
for_map=pd.DataFrame(df1.head(20))
for_map['Latitude']=lat
for_map['Longitude']=lng
for_map.head()
Out[94]:
Nation Count Latitude Longitude
0 England 1662 52.795479 -0.540240
1 Germany 1198 51.083420 10.423447
2 Spain 1072 39.326234 -4.838065
3 Argentina 937 -34.996496 -64.967282
4 France 914 46.603354 1.888334
In [95]:
wm=folium.Map(zoom_start=2,location=[0,0])
mp=folium.map.FeatureGroup()
for i,j,k in zip(for_map['Latitude'],for_map['Longitude'],for_map['Count']):
    mp.add_child(folium.CircleMarker(location=[i,j],radius=5,color='red',fill_color='Yellow'))
    folium.Marker([i,j],popup='Player Count '+str(k)).add_to(mp)
wm.add_child(mp)

wm
Out[95]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Data Modelling

1. Regression

1.1 To find 'Overall' of footballers

Here, we select some attributes of the players to predict the 'Overall' so as to get the total performance measure of the players.

In [96]:
predictors=df[['Overall','Potential','Value','Wage','Skill Moves',
'Crossing','Finishing','HeadingAccuracy','ShortPassing','Volleys','Dribbling',
'Curve','FKAccuracy','LongPassing','BallControl','Acceleration','SprintSpeed',
'Agility','Reactions','Balance','ShotPower','Jumping','Stamina','Strength',
'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision',
'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle',
'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']]
predictors.head()
Out[96]:
Overall Potential Value Wage Skill Moves Crossing Finishing HeadingAccuracy ShortPassing Volleys ... Penalties Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes
0 94 94 110500000.0 565000.0 4.0 84.0 95.0 70.0 90.0 86.0 ... 75.0 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0
1 94 94 77000000.0 405000.0 5.0 84.0 94.0 89.0 81.0 87.0 ... 85.0 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0
2 92 93 118500000.0 290000.0 5.0 79.0 87.0 62.0 84.0 84.0 ... 81.0 94.0 27.0 24.0 33.0 9.0 9.0 15.0 15.0 11.0
3 91 93 72000000.0 260000.0 1.0 17.0 13.0 21.0 50.0 13.0 ... 40.0 68.0 15.0 21.0 13.0 90.0 85.0 87.0 88.0 94.0
4 91 92 102000000.0 355000.0 4.0 93.0 82.0 55.0 92.0 82.0 ... 79.0 88.0 68.0 58.0 51.0 15.0 13.0 5.0 10.0 13.0

5 rows × 39 columns

As seen from the scatter and regression plots above, the predictor's columns vary linearly with 'Overall'. Hence we select the 'Multiple Linear Regression' algorithm.

Importing needed modules

In [97]:
from sklearn import linear_model
from sklearn.metrics import r2_score

Creating training and testing datasets

In [98]:
ms=np.random.rand(len(df))<0.75
train=predictors[ms]
test=predictors[~ms]
In [99]:
x=np.asanyarray(train[['Potential','Value','Wage','Skill Moves','Crossing',
'Finishing','HeadingAccuracy','ShortPassing','Volleys','Dribbling',
'Curve','FKAccuracy','LongPassing','BallControl','Acceleration','SprintSpeed',
'Agility','Reactions','Balance','ShotPower','Jumping','Stamina','Strength',
'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision',
'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle',
'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']])
y=np.asanyarray(train['Overall'])
In [100]:
x
Out[100]:
array([[9.400e+01, 1.105e+08, 5.650e+05, ..., 1.500e+01, 1.400e+01,
        8.000e+00],
       [9.400e+01, 7.700e+07, 4.050e+05, ..., 1.500e+01, 1.400e+01,
        1.100e+01],
       [9.300e+01, 1.185e+08, 2.900e+05, ..., 1.500e+01, 1.500e+01,
        1.100e+01],
       ...,
       [6.300e+01, 6.000e+04, 1.000e+03, ..., 9.000e+00, 5.000e+00,
        1.200e+01],
       [6.700e+01, 6.000e+04, 1.000e+03, ..., 1.000e+01, 6.000e+00,
        1.300e+01],
       [6.600e+01, 6.000e+04, 1.000e+03, ..., 9.000e+00, 1.200e+01,
        9.000e+00]])
In [101]:
y
Out[101]:
array([94, 94, 92, ..., 47, 47, 46], dtype=int64)

Fitting the model

In [102]:
regr=linear_model.LinearRegression()
regr.fit(x,y)
Out[102]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [103]:
print("Coefficients:",regr.coef_)
print("\nIntercept:",regr.intercept_)
Coefficients: [ 2.10627489e-01  8.75778492e-08  6.24602506e-06  9.76594170e-01
  3.51105247e-02  1.64723949e-02  7.31253206e-02  4.99534810e-02
 -3.19766787e-03 -1.73395578e-02  1.00924451e-03  9.01523418e-03
 -6.34858080e-03  1.01765810e-01  1.01395955e-02  1.35312230e-02
  1.54747760e-02  2.27641840e-01 -1.61914151e-02  1.60970659e-02
  8.97213032e-03  1.91617039e-02  4.72023743e-02 -5.94666950e-03
  1.17998109e-02  1.16937932e-02 -3.47153836e-02 -2.52632988e-02
  6.85185081e-03  9.44627274e-02  2.97385569e-02  7.27159395e-03
 -2.21729650e-02  5.74494470e-02  5.80808423e-02  2.74748760e-02
  6.88187245e-02  6.15259115e-02]

Intercept: 4.169329184804496

Prediction

In [104]:
Y=regr.predict(test[['Potential','Value','Wage','Skill Moves','Crossing',
'Finishing','HeadingAccuracy','ShortPassing','Volleys','Dribbling',
'Curve','FKAccuracy','LongPassing','BallControl','Acceleration','SprintSpeed',
'Agility','Reactions','Balance','ShotPower','Jumping','Stamina','Strength',
'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision',
'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle',
'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']])

Comparison

In [105]:
x1=np.asanyarray(test[['Potential','Value','Wage','Skill Moves','Crossing',
'Finishing','HeadingAccuracy','ShortPassing','Volleys','Dribbling',
'Curve','FKAccuracy','LongPassing','BallControl','Acceleration','SprintSpeed',
'Agility','Reactions','Balance','ShotPower','Jumping','Stamina','Strength',
'LongShots', 'Aggression', 'Interceptions', 'Positioning', 'Vision',
'Penalties', 'Composure', 'Marking', 'StandingTackle', 'SlidingTackle',
'GKDiving', 'GKHandling', 'GKKicking', 'GKPositioning', 'GKReflexes']])
y1=np.asanyarray(test['Overall'])
In [106]:
a={"Predicted Values":Y,"Actual Values":y1}
comp=pd.DataFrame(a)
comp=comp.astype({"Predicted Values":'int64'})
comp.set_index(["Predicted Values","Actual Values"],inplace=True)
comp.head()
Out[106]:
Predicted Values Actual Values
98 91
94 90
91 90
89
93 89

Model Evaluation

Accuracy Measurement

In [107]:
f1=np.mean((Y-y1)**2)
print("Residual sum of squares: {}".format(f1))
Residual sum of squares: 4.863003613128007
In [108]:
f2=regr.score(x1,y1)
print("Variance Score:",f2)
Variance Score: 0.9003073364619885
In [109]:
f3=np.mean(np.absolute(Y-y1))
print("Mean absolute error:",f3)
Mean absolute error: 1.736450970043051
In [110]:
f4=r2_score(Y,y1)
print("R2-Score:",f4)
print("R2-Score Percentage",round(f4*100,3),"%")
R2-Score: 0.8885683150443772
R2-Score Percentage 88.857 %
In [112]:
from sklearn import metrics
f5=np.sqrt(metrics.mean_squared_error(Y,y1))
print("MSE:",f5)
MSE: 2.2052218965736774

Model Summary

In [113]:
a={"Feature":["Residual Sum Of Squares","Variance Score","Mean Absolute Error","Mean Squared Error","R2-Score Percentage"],
  "Value":[round(f1,3),round(f2,3),round(f3,3),round(f5,3),round(f4*100,3)]}
summary=pd.DataFrame(a)
summary.set_index(['Feature','Value'],inplace=True)
summary
Out[113]:
Feature Value
Residual Sum Of Squares 4.863
Variance Score 0.900
Mean Absolute Error 1.736
Mean Squared Error 2.205
R2-Score Percentage 88.857

2. Clustering

Here, we cluster the players as per their potenial and overall performance, so as to assign them with certain labels.

We will use the k-means clustering algorithm.

In [114]:
from sklearn.cluster import KMeans

We tend to cluster the players into 3 major categories:

  1. Best
  2. Good
  3. Average
  4. Below Average

In [115]:
predictors1=df[['Overall','Potential']]

Data Normalization/Standardization

In [116]:
from sklearn.preprocessing import StandardScaler
X=predictors1.values[:,1:]
clus_dataset=StandardScaler().fit_transform(X)
In [117]:
# Original Values
X
Out[117]:
array([[94],
       [94],
       [93],
       ...,
       [67],
       [66],
       [66]], dtype=int64)
In [118]:
# Normalized Values
clus_dataset
Out[118]:
array([[ 3.69809177],
       [ 3.69809177],
       [ 3.53512784],
       ...,
       [-0.70193445],
       [-0.86489839],
       [-0.86489839]])

Fitting the Model

In [119]:
kMeans=KMeans(init='k-means++',n_clusters=4)
kMeans.fit(clus_dataset)
Out[119]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=4, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [120]:
# Generating labels
labels=kMeans.labels_
In [121]:
predictors1['Labels']=labels
df['Labels']=labels

Cluster 1

In [122]:
cluster1=df.loc[df['Labels']==0,df.columns[[1]+[4]+[5]+[82]]]
pos1=cluster1['Potential'].mean()
ov1=cluster1['Overall'].mean()
mem1=cluster1.shape[0]
print("Mean Potential of cluster:",pos1)
print("Mean Overall of cluster:",ov1)
print("Members in cluster 1:",mem1)
cluster1.head()
Mean Potential of cluster: 68.60502936304773
Mean Overall of cluster: 64.44285499171812
Members in cluster 1: 6641
Out[122]:
Name Overall Potential Labels
3968 Allison Sireo 71 71 0
3969 Z. Stieber 71 71 0
3971 O. Skúlason 71 71 0
3973 M. Pektemek 71 71 0
3974 R. Schüller 71 71 0

Cluster 2

In [123]:
cluster2=df.loc[df['Labels']==1,df.columns[[1]+[4]+[5]+[82]]]
pos2=cluster2['Potential'].mean()
ov2=cluster2['Overall'].mean()
mem2=cluster2.shape[0]
print("Mean Potential of cluster:",pos2)
print("Mean Overall of cluster:",ov2)
print("Members in cluster 1:",mem2)
cluster2.head()
Mean Potential of cluster: 81.20923722883136
Mean Overall of cluster: 73.55213435969209
Members in cluster 1: 2858
Out[123]:
Name Overall Potential Labels
0 L. Messi 94 94 1
1 Cristiano Ronaldo 94 94 1
2 Neymar Jr 92 93 1
3 De Gea 91 93 1
4 K. De Bruyne 91 92 1

Cluster 3

In [124]:
cluster3=df.loc[df['Labels']==2,df.columns[[1]+[4]+[5]+[82]]]
pos3=cluster3['Potential'].mean()
ov3=cluster3['Overall'].mean()
mem3=cluster3.shape[0]
print("Mean Potential of cluster:",pos3)
print("Mean Overall of cluster:",ov3)
print("Members in cluster 1:",mem3)
cluster3.head()
Mean Potential of cluster: 74.2297538351766
Mean Overall of cluster: 68.18444523724581
Members in cluster 1: 5606
Out[124]:
Name Overall Potential Labels
895 M. Harnik 77 77 2
896 B. Moukandjo 77 77 2
898 G. Castro 77 77 2
900 F. Johnson 77 77 2
901 L. López 77 77 2

Cluster 4

In [125]:
cluster4=df.loc[df['Labels']==3,df.columns[[1]+[4]+[5]+[82]]]
pos4=cluster4['Potential'].mean()
ov4=cluster4['Overall'].mean()
mem4=cluster4.shape[0]
print("Mean Potential of cluster:",pos4)
print("Mean Overall of cluster:",ov4)
print("Members in cluster 1:",mem4)
cluster4.head()
Mean Potential of cluster: 62.687943262411345
Mean Overall of cluster: 59.82882011605416
Members in cluster 1: 3102
Out[125]:
Name Overall Potential Labels
9928 J. Akinde 65 65 3
9929 D. McGregor 65 65 3
9932 Teixeira José 65 65 3
9933 A. Fernández 65 65 3
9938 A. Considine 65 65 3

As per the results, we assign the following tags to the clusters:

  1. Cluster 1: Best
  2. Cluster 2: Below Average
  3. Cluster 3: Good
  4. Cluster 4: Average

Summary

In [128]:
clus={"Index":[1,2,3,4],"Tags":["Best","Good","Average","Below Average"],
   "Potential Mean":[round(pos1,3),round(pos3,3),round(pos4,3),round(pos2,3)],
   "Overall Mean":[round(ov1,3),round(ov3,3),round(ov4,3),round(ov2,3)],
   "Players":[mem1,mem3,mem4,mem2],"Cluster":[3,2,4,1]}
clus_sum=pd.DataFrame(clus)
clus_sum.set_index(['Index'],inplace=True)
clus_sum
Out[128]:
Tags Potential Mean Overall Mean Players Cluster
Index
1 Best 68.605 64.443 6641 3
2 Good 74.230 68.184 5606 2
3 Average 62.688 59.829 3102 4
4 Below Average 81.209 73.552 2858 1

Visualizing Clusters

In [129]:
sns.set_style('whitegrid')
sns.lmplot('Overall','Potential',data=df,hue='Labels',palette="husl",size=6,aspect=1,fit_reg=False)
Out[129]:
<seaborn.axisgrid.FacetGrid at 0x2d4f3ce23c8>